Introduction

This investigation aims to characterise what makes a gene energetically efficient to synthesise, using gene-centric data gathered from multiple different high-throughput studies on the fission yeast Schizosaccharomyces pombe (Bitton et al. 2015) (Marguerat et al. 2012) (Hasan et al. 2014) (Christiano et al. 2014) (Matsuyama et al. 2006) (Kim et al. 2010). To investigate energy efficiency, this term must first be defined and its factors must be identified. Factors that will be assessed include the NumberIntrons, NumResidues, mRNA_copies_per_cell, protein_copies_per_cell, mRNA.stabilities, and whether the gene is essential. This report was developed using Rmarkdown (Allaire et al. 2018), with importing and analysis conducted in R (R Core Team 2022), using tidyverse packages (Wickham et al. 2019) to clean the data.

Methods

Data Description

The data can be downloaded using the following link: fission_yeast_data.2018-11-21.Rda

Most of the database was sourced from the Angeli website: http://bahlerweb.cs.ucl.ac.uk/cgi-bin/GLA/GLA_input. The database also includes data from Grech et al.(2019), with the data columns being described below in Table 1:.

Table 1: Description of data columns. *Q - Quantitative, C - Categorical
Column name Data Type* (Q/C) Notes Reference
NumberIntrons Number of introns in the gene Q (Wood et al. 2002)
NumResidues Number of amino acid residues in the protein. Q Will be NA for ncRNAs. (Wood et al. 2002)
protein_coding Is the gene proteincoding? C 1 = protein coding 0 = ncRNA (Wood et al. 2002)
ncRNA Is the gene a noncanonical ncRNA? C 1 = ncRNA 0 = either proteincoding or a canonical ncRNA (rRNA, tRNA, snoRNA) (Wood et al. 2002)
Rel_telomere The relative distance to the telomere. Q 0 = at the telomere, 1 = in the middle of the chromosome (Wood et al. 2002)
mRNA_copies_per_cell The number of mRNA copies per cell, from proliferating (growing) cells. Q (Marguerat et al. 2012)
protein_copies_per_cell The number of protein copies per cell, from proliferating (growing) cells. Q (Marguerat et al. 2012)
mRNA.stabilites The half-life of the mRNA in minutes. Q (Hasan et al. 2014)
GeneticDiversity The level of genetic diversity within S. pombe strains. Q The average pairwise similarity (π) (Jeffares et al. 2015)
ProteinHalfLife The half-life of the protein in minutes. Q (Christiano et al. 2014)
Golgi, Mitochondrion, Nuclear_dots, Nuclear_envelope, Nucleolus, Nucleus, Vacuole Where the protein is located. C 1 = in this location 0 = not in this location (Matsuyama et al. 2006)
essential Is this an ‘essential’ gene (required for cell survival)? C (Kim et al. 2010)
chromsome Which chromosome the gene is on C S. pombe has three chromosomes (I, II, III) and a mitochondria (MT). (Wood et al. 2002)
start, end The position of the gene in the chromosome Q Gene length = end - start +1 (Wood et al. 2002)
solid.media.KO.fitness The colony size of a strain with this gene knocked out (on agar medium). Q The colony size is a proxy for knockout ‘fitness’. (Malecki et al. 2016)
gene.expression.RPKM The RNA expression level from RNA-seq, from proliferating (growing) cells. Q (Atkinson et al. 2018)
conservation.phlyoP The level of conservation in this gene. Q How fast the gene has changed over time. (Grech et al. 2019)

Analysis

Identifying Energy Efficient mRNA Transcripts

Spending a high amount of energy on the production of many mRNA transcripts that only last a short time may not be the best use of the cell’s energy. To assess whether there were any energy-efficient or inefficient mRNA transcripts being produced, values of significance were first calculated by finding the values that fall outside the interval formed by the 97.5 and 2.5 percentiles, seen in Table 2, with the results of the data visualisation seen in Figure 1.

Table 2: Upper and lower bounds for values of significance
Lower.Percentile Upper.Percentile
mRNA copies 0.08800 67.0000
mRNA Half-Life (mins) 15.78272 100.5964

To determine the efficiency score of the genes for these variables, the following categories were made:

  • Genes that produce few transcripts and have long mRNA half-lives are deemed Very Efficient. As mRNA production requires a large amount of energy, producing few transcripts significantly frees up energy stores for other energy-intensive activities. Furthermore, as the transcript has a long half-life, the production of more transcripts becomes less of a necessity.
  • Genes that have long mRNA half-lives that produce a standard number of transcripts or that produce few transcripts and have normal mRNA half-lives are deemed Efficient.
  • Genes that find themselves producing a standard number of mRNA transcripts with normal mRNA half-lives are deemed Standard.
  • • Genes that produce many mRNA transcripts but also produce transcripts with long mRNA half-lives or that produce transcripts with short mRNA half-lives but produce a small number of these transcripts are deemed Standard.
  • Genes that produce a standard amount of mRNA transcripts with short mRNA half-lives or that produce many mRNA transcripts with standard mRNA half-lives are deemed Inefficient.
  • Genes with short mRNA half-lives and that produce many transcripts are deemed Very Inefficient.

Figure 1: Interactive plot showing the mRNA half-life plotted against the number of mRNA transcripts per cell, with both variables being on a base-10 logarithmic scale. Values represent genes classified as either ‘Very Efficient’, ‘Efficient’, ‘Standard’, or ‘Inefficient’ dependent on their mRNA stability and mRNA copy number. The interval of significance for mRNA stability is x < 15.8, x > 101. The interval of significance for the number of mRNA transcripts is x < 0.088, x > 67.

Using 1, we can identify 1 mRNA transcript classified as Very Efficient, SPAC869.09, as well as 69 Efficient mRNA transcripts. Furthermore, Very Inefficient transcripts are non-existent, highlighting that S. pombe has succeeded in ensuring it produces transcripts that are not completely inefficient. However, 180 transcripts can be identified as Inefficient, meaning 3.87% of S. pombe’s transcripts may not be utilising its energy stores as efficiently as possible.

Figure 1 appears to show a positive correlation between mRNA.stabilities and mRNA_copies_per_cell, with the production of more mRNA transcripts appearing to increase the half-life of the transcript. A Pearson’s correlation coefficient test (Benesty et al. 2009) can be performed, producing a correlation coefficient of 0.469 3 s.f., with a p-value of 3.6142428^{-253}, proving that there is a statistically significant moderately strong positive correlation between the two variables. Furthermore, it can now be suggested that genes that are more costly on the cell’s energy store during their transcription are likely to produce transcripts with a greater level of stability. It can be hypothesised that the cell is ‘aware’ of the amount of energy it has spent on a transcript’s production, therefore the stability of the transcript increases to ensure the cell’s efforts are not wasted. Furthermore, studies have shown that mRNA transcripts can be stored and not immediately translated (Shyu, Wilkinson, and Hoof 2008), meaning the genes would require a longer half-life to ensure later translation. This would result in the build-up of these transcripts, leading to an increase in mRNA_copies_per_cell.

As SPAC869.09 has been identified as the most energy-efficient transcript, it can be used as a benchmark for identifying the key characteristics of an energy-efficient gene, seen in Table 3:

Table 3: Table displaying properties of SPAC869.09
1554
gene SPAC869.09
NumberIntrons 0
NumResidues 116
Rel_telomere 0.0451902793736816
mRNA_copies_per_cell 0.077
protein_copies_per_cell 2775.06
mRNA.stabilities 217.192
GeneticDiversity 0
ProteinHalfLife NA
Golgi 0
Mitochondrion 0
Nuclear_dots 0
Nuclear_envelope 0
Nucleolus 0
Nucleus 0
Vacuole 0
essential 0
solid.media.KO.fitness 1.097457
gene.expression.RPKM 0.2405
conservation.phyloP 0
chromosome I
gene_length 605
Efficiency Very Efficient

Genes Lacking Introns Have Shorter mRNA Half-Lives

SPAC869.09 does not have any introns, raising the question of whether this has any correlation with the number of mRNA copies produced and their stability.

51.1% of S. pombe’s genome lacks introns, therefore SPAC869.09’s lack of introns is of no significance. However, it highlights that S. pombe uses half of its transcription resources to produce intronless transcripts.

Violin plots illustrating the distribution of both the number of mRNA transcripts and the mRNA half-life in minutes of those transcripts on a base-10 logarithmic scale for genes containing introns and lacking introns.

Figure 2: Violin plots illustrating the distribution of both the number of mRNA transcripts and the mRNA half-life in minutes of those transcripts on a base-10 logarithmic scale for genes containing introns and lacking introns.

Figure 2 appears to not show a visual significant difference between the mRNA.stabilities and mRNA_copies_per_cell a gene produces when it has introns compared to when they do not. When performing a Wilcoxon test (Wilcoxon 1992), a p-value of 0.009 (3 d.p.) is produced with a test statistic W of 3.343824^{6}, showing that there is a significant difference in transcript frequency. However, although significance is present, the magnitude of the difference is low (0.2), therefore the difference will be deemed insignificant.

Testing mRNA.stabilities between genes that have introns and genes that lack introns produces a p-value of 6.8314915^{-27} with a test statistic W of 2.221964^{6}, showing that there is a significant difference in half-lives. The median mRNA half-life of S. pombe genes that lack introns and have introns is 28.85 (2 d.p.) and 33.36 (2 d.p.), respectively. With the presence of introns appearing to increase mRNA stability, it could be hypothesised that the assembly of the spliceosome increases mRNA stability. This is supported by a study from Lu and Cullen (2003), in which they found that the absence of introns resulted in substantially less stable mRNA transcripts. Another study by Wang et al. (2007) “found that human intron-containing genes have more stable mRNAs than intron-less genes”. Assembly of the spliceosome requires a significant amount of energy consumption as many proteins must be synthesised for the spliceosome to be produced, with identification and removal of introns requiring further energy consumption. Furthermore, synthesis of these introns, only for them to be spliced and degraded can be deemed as a waste of energy (albeit important for regulatory purposes), therefore the presence of introns is paradoxical to bettering a gene’s energy efficiency. However, it can be argued that the presence of introns allows for alternative splicing, a process that enables a single gene to code for multiple proteins. Without introns, the cell would have to store more genes and undergo a greater degree of transcription, requiring more energy.

More Introns Lead to Greater mRNA Stability

Visualised in Figure 3 is an assessment of whether the number of introns present in a gene affects its mRNA stability.

Figure 3: Interactive box plots showing the distribution of mRNA half-lives in minutes on a base-10 logarithmic scale for genes containing differing number of introns.

The visualisation shows a slight negative correlation between mRNA.stabilities and NumberIntrons, due to the decrease in outliers. However, the overall distribution of the data at each point appears to show a slight increase in mRNA.stabilities relative to NumberIntrons. When testing, the number of introns present must be classed as a ranked variable so that Spearman’s rank correlation test can be completed (Spearman 1961), revealing a significant positive correlation between mRNA.stabilities and NumberIntrons, with a p-value of 2.0417308^{-26}. However, the positive correlation is weak, with the sample estimate of the correlation coefficient being 0.155 (3 s.f.), therefore, it is not certain that the mRNA stability is affected by the assembly of spliceosomes.

This is assessed further by testing for significant differences in NumberIntrons between Efficient, Standard and Inefficient genes.

A Kruskal-Wallis (Kruskal and Wallis 1952) test produces a p-value of 1.0507453^{-11}, revealing that there is a significant difference between the number of introns present in Efficient, Standard and Inefficient genes. A post-hoc Wilcoxon Rank Sum Test produces Table 4, with Table 5 displaying the median values.

Table 4: Results of post-hoc test of the Kruskal-Wallis test. 0 represents value 6.8e-12 but is displayed as 0 due to rounding.
Efficient Inefficient
Inefficient 1e-07 NA
Standard 1e+00 0
Table 5: Median number of introns in efficient, inefficent and standard genes.
Efficiency median_NumberIntrons
Efficient 1
Inefficient 0
Standard 1

This reveals that Efficient and Standard genes will have 1 intron on average, whereas Inefficient genes lack introns on average. This finding correlates with Figure 2, where we found that genes that lack introns have shorter half-lives, further suggesting that lacking introns results in a gene being less efficient.

Are Non-Essential Genes More Efficient

SPAC869.09 is not an essential gene, raising the question of whether there is a correlation between the Efficiency of a gene and if it is essential or not.

Table 6: Table showing distribution of essential and non-essential genes with their efficiency grades. Very efficient gene, SPAC869.09, included as essential gene in table.
Efficient Standard Inefficient
Essential 11 1397 48
Non-Essential 59 3008 132

As essential and Efficiency are both categorical variables, a Pearson’s Chi-squared test can be performed (Pearson 1900). This produces a p-value of 0.007 (3 d.p.), suggesting that there is a significant association between the two variables. This is visualised in Figure 4 using the ggbarstats function from the ggstatsplot() package (Patil 2021).

This figure shows the association between the Efficiency and Essential variables, alongside the following test statistics: chi-squared estimate, chi-squared p-value and Cramer’s V. P-values for each efficiency group are also displayed. The Cramer’s V value can be ignored as this test requires both categorical variables to have more than two levels. There are only two levels present for the Essential variable.

Figure 4: This figure shows the association between the Efficiency and Essential variables, alongside the following test statistics: chi-squared estimate, chi-squared p-value and Cramer’s V. P-values for each efficiency group are also displayed. The Cramer’s V value can be ignored as this test requires both categorical variables to have more than two levels. There are only two levels present for the Essential variable.

From Figure 4, we can see that there is a significant increase in the number of non-essential genes present amongst the group of Efficient genes, further suggesting that non-essential genes are more efficient. As essential genes are required for the cell’s survival, it can be hypothesised that they were never able to evolve to become efficient, as they became fixed in the genome due to their importance towards S. pombe’s survival. In contrast, non-essential genes did not need to be fixed, enabling a greater degree of adaptability in the gene to make it more efficient. As a result, the non-essential genes have become more efficient than their essential counterparts.

mRNA Half-Life Positively Correlates with Protein Copies Produced

It can be predicted that the correlation that was seen between mRNA_copies_per_cell and mRNA.stabilities will translate to a correlation with protein_copies_per_cell. To assess this, we can visualise the relationship between the values by recreating Figure 1, however, instead of using colour to represent the efficiency of the genes, we can use colour to represent the number of proteins produced from each transcript. This can be seen in Figure 5.

mRNA half-life plotted against the number of mRNA transcripts per cell, with both variables being on a base-10 logarithmic scale. Colour represents how many protein copies a gene has produced in the cell on a base-10 logarithmic scale. Each logarithmic value was then rounded to the nearest 0.5, where they were converted into ordinal variables. A strong positive correlation can be found between the number of mRNA transcripts produced and the number of protein copies produced.

Figure 5: mRNA half-life plotted against the number of mRNA transcripts per cell, with both variables being on a base-10 logarithmic scale. Colour represents how many protein copies a gene has produced in the cell on a base-10 logarithmic scale. Each logarithmic value was then rounded to the nearest 0.5, where they were converted into ordinal variables. A strong positive correlation can be found between the number of mRNA transcripts produced and the number of protein copies produced.

To test for correlations between the three variables, a Pearson’s correlation test can be performed. The results of this test can be seen in Table 7:

Table 7: Table showing Pearson’s correlation coefficient values for mRNA half-life, mRNA transcript frequency and protein frequency.
mRNA.stabilities mRNA_copies_per_cell protein_copies_per_cell
mRNA.stabilities 1.0000000 0.4535368 0.3768907
mRNA_copies_per_cell 0.4535368 1.0000000 0.8189707
protein_copies_per_cell 0.3768907 0.8189707 1.0000000

A correlation coefficient value of 0.819 (3 s.f.) between protein_copies_per_cell and mRNA_copies_per_cell suggests that there is a strong positive correlation between the two. This is expected as it is likely that each transcript is translated into a protein. A correlation coefficient value of 0.377 (3 s.f.) between mRNA.stabilities and protein_copies_per_cell suggests that there is a moderately positive correlation between the two variables, possibly due to the activity of micro-RNAs (miRNA). The number of protein copies produced from a gene may decrease when mRNA half-life decreases due to the presence of multiple miRNA target sites in the gene (Shyu, Wilkinson, and Hoof 2008). If a gene has multiple miRNA target sites, then it is more likely that miRNAs will bind and degrade the mRNA, reducing its half-life and intervening in the expression of that protein. This would then result in a decrease in the number of protein copies of a gene in the cell, therefore this correlation may be powered by the presence of miRNA target sites.

Conclusion

In conclusion, mRNA stability appears to be the characteristic that most strongly indicates the energy efficiency of a gene in s. pombe during transcription. mRNA stability has a positive correlation with transcription frequency, and with the number of introns present in a gene, as supported by previous studies. Further analysis of the energy efficiency of genes during transcription would involve assessing the localisation of these proteins, and investigating if certain locations harbour genes of greater energy efficiency. A brief analysis of the energy efficiency of a gene during translation provided evidence for a correlation between the number of protein copies present and its mRNA stability. This may be due to the presence of miRNA target sites; however, further analysis would be required to assess the likelihood of this being true. Future analysis of the energy efficiency of genes during translation could focus on protein half-life and its relationship with various factors.

Word count

This document: 1238
README: 160
Total: 1398

References

Allaire, J, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2018. “Rmarkdown: Dynamic Documents for R.” R Package Version 1 (11).
Benesty, Jacob, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. “Pearson Correlation Coefficient.” In Noise Reduction in Speech Processing, 37–40. Springer.
Bitton, Danny A, Falk Schubert, Shoumit Dey, Michal Okoniewski, Graeme C Smith, Sanjay Khadayate, Vera Pancaldi, Valerie Wood, and Jürg Bähler. 2015. AnGeLi: A Tool for the Analysis of Gene Lists from Fission Yeast.” Frontiers in Genetics.
Christiano, Romain, Nagarjuna Nagaraj, Florian Fröhlich, and Tobias C Walther. 2014. “Global Proteome Turnover Analyses of the Yeasts s. Cerevisiae and s. Pombe.” Cell Reports.
Grech, Leanne, Daniel C Jeffares, Christoph Y Sadée, Marı́a Rodrı́guez-López, Danny A Bitton, Mimoza Hoti, Carolina Biagosch, et al. 2019. “Fitness Landscape of the Fission Yeast Genome.” Mol. Biol. Evol. 36 (8): 1612–23.
Hasan, Ayesha, Cristina Cotobal, Caia D S Duncan, and Juan Mata. 2014. “Systematic Analysis of the Role of RNA-binding Proteins in the Regulation of RNA Stability.” PLoS Genet. 10 (11): e1004684.
Kim, Dong-Uk, Jacqueline Hayles, Dongsup Kim, Valerie Wood, Han-Oh Park, Misun Won, Hyang-Sook Yoo, et al. 2010. “Analysis of a Genome-Wide Set of Gene Deletions in the Fission Yeast Schizosaccharomyces Pombe.” Nat. Biotechnol. 28 (6): 617–23.
Kruskal, William H, and W Allen Wallis. 1952. “Use of Ranks in One-Criterion Variance Analysis.” J. Am. Stat. Assoc. 47 (260): 583–621.
Lu, Shihua, and Bryan R Cullen. 2003. “Analysis of the Stimulatory Effect of Splicing on mRNA Production and Utilization in Mammalian Cells.” RNA 9 (5): 618–30.
Marguerat, Samuel, Alexander Schmidt, Sandra Codlin, Wei Chen, Ruedi Aebersold, and Jürg Bähler. 2012. “Quantitative Analysis of Fission Yeast Transcriptomes and Proteomes in Proliferating and Quiescent Cells.” Cell 151 (3): 671–83.
Matsuyama, Akihisa, Ritsuko Arai, Yoko Yashiroda, Atsuko Shirai, Ayako Kamata, Shigeko Sekido, Yumiko Kobayashi, et al. 2006. ORFeome Cloning and Global Analysis of Protein Localization in the Fission Yeast Schizosaccharomyces Pombe.” Nat. Biotechnol. 24 (7): 841–47.
Patil, Indrajeet. 2021. Visualizations with statistical details: The ’ggstatsplot’ approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.
Pearson, Karl. 1900. “X. On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50 (302): 157–75. https://doi.org/10.1080/14786440009463897.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Shyu, Ann-Bin, Miles F Wilkinson, and Ambro van Hoof. 2008. “Messenger RNA Regulation: To Translate or to Degrade.” EMBO J. 27 (3): 471–81.
Spearman, C. 1961. “The Proof and Measurement of Association Between Two Things.” In Studies in Individual Differences: The Search for Intelligence , (Pp, edited by James J Jenkins, 774:45–58.
Wang, Hai-Fang, Liang Feng, and Deng-Ke Niu. 2007. “Relationship Between mRNA Stability and Intron Presence.” Biochem. Biophys. Res. Commun. 354 (1): 203–8.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wilcoxon, Frank. 1992. “Individual Comparisons by Ranking Methods.” In Breakthroughs in Statistics: Methodology and Distribution, edited by Samuel Kotz and Norman L Johnson, 196–202. New York, NY: Springer New York.